Search CORE

18 research outputs found

BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

Author: Blandón María Andrea Cruz
Bredin Hervé
Cristia Alejandrina
Dupoux Emmanuel
Lavechin Marvin
Räsänen Okko
Sy Yaya
Titeux Hadrien
Publication venue
Publication date: 02/06/2023
Field of study

Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our understanding of how infants learn language, simulations must closely emulate real-life situations by training on developmentally plausible corpora and benchmarking against appropriate test sets. To this end, we propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels, both of which are compatible with the vocabulary typical of children's language experiences. This paper introduces the benchmark and summarizes a range of experiments showing its usefulness. In addition, we highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.Comment: Proceedings of Interspeech 202

arXiv.org e-Print Archive

Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation

Author: Bergelson Elika
Boissonnet Alodie
Bredin Hervé
Copet Jade
Cristia Alejandrina
Dupoux Emmanuel
Lavechin Marvin
Métais Marianne
Rivière Morgane
Titeux Hadrien
Publication venue
Publication date: 27/10/2022
Field of study

Most automatic speech processing systems are sensitive to the acoustic environment, with degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a pipeline to simulate audio segments recorded in noisy and reverberant conditions. We then use the simulated audio to jointly train the Brouhaha model for voice activity detection, signal-to-noise ratio estimation, and C50 room acoustics prediction. We show how the predicted SNR and C50 values can be used to investigate and help diagnose errors made by automatic speech processing tools (such as pyannote.audio for speaker diarization or OpenAI's Whisper for automatic speech recognition). Both our pipeline and a pretrained model are open source and shared with the speech community

arXiv.org e-Print Archive

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

ProsAudit, a prosodic benchmark for self-supervised speech models

Author: de Seyssel Maureen
Dupoux Emmanuel
Lavechin Marvin
Ludusan Bogdan
Revilla Andrea Santos
Thomas Arthur
Titeux Hadrien
Virlet Gwendal
Wisniewski Guillaume
Publication venue
Publication date: 01/06/2023
Field of study

We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inserted between words and within words. We also provide human evaluation scores on this benchmark. We evaluated a series of SSL models and found that they were all able to perform above chance on both tasks, even when evaluated on an unseen language. However, non-native models performed significantly worse than native ones on the lexical task, highlighting the importance of lexical knowledge in this task. We also found a clear effect of size with models trained on more data performing better in the two subtasks.Comment: Accepted at Interspeech 2023. 4 pages + references, 1 figur

arXiv.org e-Print Archive

Vocal markers from sustained phonation in Huntington's Disease

Author: Bachoud-Lévi Anne-Catherine
Bagnou Jennifer Hamet
Cao Xuan Nga
Dupoux Emmanuel
Lemoine Laurie
Montillot Justine
Riad Rachid
Titeux Hadrien
Publication venue
Publication date: 31/07/2020
Field of study

Disease-modifying treatments are currently assessed in neurodegenerative diseases. Huntington's Disease represents a unique opportunity to design automatic sub-clinical markers, even in premanifest gene carriers. We investigated phonatory impairments as potential clinical markers and propose them for both diagnosis and gene carriers follow-up. We used two sets of features: Phonatory features and Modulation Power Spectrum Features. We found that phonation is not sufficient for the identification of sub-clinical disorders of premanifest gene carriers. According to our regression results, Phonatory features are suitable for the predictions of clinical performance in Huntington's Disease.Comment: To appear at INTERSPEECH 2020. 1 pages of supplementary material appear only in the arxiv version. Code to replicate https://github.com/bootphon/sustained-phonation-feature

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

A comparison study on patient-psychologist voice diarization

Author: Bachoud-Lévi Anne-Catherine
Bagnou Jennifer,
Cao Xuan,
Dupoux Emmanuel
Lemoine Laurie
Montillot Justine
Riad Rachid
Sliwinski Agnes
Titeux Hadrien
Publication venue: HAL CCSD
Publication date: 27/05/2022
Field of study

International audienceConversations between a clinician and a patient, in natural conditions, are valuable sources of information for medical follow-up. The automatic analysis of these dialogues could help extract new language markers and speed up the clinicians' reports. Yet, it is not clear which model is the most efficient to detect and identify the speaker turns, especially for individuals with speech disorders. Here, we proposed a split of the data that allows conducting a comparative evaluation of different diarization methods. We designed and trained end-to-end neural network architectures to directly tackle this task from the raw signal and evaluate each approach under the same metric. We also studied the effect of fine-tuning models to find the best performance. Experimental results are reported on naturalistic clinical conversations between Psychologists and Interviewees, at different stages of Huntington's disease, displaying a large panel of speech disorders. We found out that our best end-to-end model achieved 19.5% IER on the test set, compared to 23.6% achieved by the finetuning of the X-vector architecture. Finally, we observed that we could extract clinical markers directly from the automatic systems, highlighting the clinical relevance of our methods

INRIA a CCSD electronic archive server

Vocal markers from sustained phonation in Huntington's Disease

Author: Bachoud-Lévi Anne-Catherine
Cao Xuan Nga
Dupoux Emmanuel
Hamet Bagnou Jennifer
Lemoine Laurie
Montillot Justine
Riad Rachid
Titeux Hadrien
Publication venue: HAL CCSD
Publication date: 25/10/2020
Field of study

To appear at INTERSPEECH 2020. 1 pages of supplementary material appear only in the arxiv version. Code to replicate https://github.com/bootphon/sustained-phonation-featuresInternational audienceDisease-modifying treatments are currently assessed in neurodegenerative diseases. Huntington's Disease represents a unique opportunity to design automatic sub-clinical markers, even in premanifest gene carriers. We investigated phonatory impairments as potential clinical markers and propose them for both diagnosis and gene carriers follow-up. We used two sets of features: Phonatory features and Modulation Power Spectrum Features. We found that phonation is not sufficient for the identification of sub-clinical disorders of premanifest gene carriers. According to our regression results, Phonatory features are suitable for the predictions of clinical performance in Huntington's Disease

INRIA a CCSD electronic archive server

Speaker detection in the wild: Lessons learned from JSALT 2019

Author: Abdoli Sajjad
Ben-Yair Bar
Bouaziz Wassim
Bredin Hervé
Bullock Latane
Castan Diego
Chen Sizhu
Cristia Alejandrina
Dehak Najim
Du Jun
Dupoux Emmanuel
Galmant Léo
García Paola
Gill Marie-Philippe
Guo Ling
Kataria Saurabh
Lavechin Marvin
Lee Kong Aik
Nidadavolu Phani Sankar
Okabe Koji
Sun Lei
Titeux Hadrien
Villalba Jesus
Wang Xin
Publication venue: HAL CCSD
Publication date: 02/12/2019
Field of study

Submitted to ICASSP 2020This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an effective diarization improves detection, and not having a diarization stage impoverishes the performance. All the different configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

Speaker detection in the wild: Lessons learned from JSALT 2019

Author: Abdoli Sajjad
Ben-Yair Bar
Bouaziz Wassim
Bredin Hervé
Bullock Latane
Castan Diego
Chen Sizhu
Cristia Alejandrina
Dehak Najim
Du Jun
Dupoux Emmanuel
Galmant Léo
García Paola
Gill Marie-Philippe
Guo Ling
Kataria Saurabh
Lavechin Marvin
Lee Kong Aik
Nidadavolu Phani Sankar
Okabe Koji
Sun Lei
Titeux Hadrien
Villalba Jesus
Wang Xin
Publication venue: HAL CCSD
Publication date: 01/11/2020
Field of study

International audienceThis paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an effective diarization improves detection, and not having a diarization stage impoverishes the performance. All the different configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection

INRIA a CCSD electronic archive server